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Abstract 

, We study the approximate string matching and regular expression matching problem for 

' the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary 

compression schemes. We present a time-space trade-off that leads to algorithms improving the 
I""!' previously known complexities for both problems. In particular, we significantly improve the 

. space bounds, which in practical applications are likely to be a bottleneck. 

C/3 ! 

O ■ 1 Introduction 

CN ' Modern text databases, e.g. for biological and World Wide Web data, are huge. To save time and 

^ . space, it is desireable if data can be kept in compressed form and still allow efficient searching. Mo- 

QQ I tivated by this Amir and Benson [2,3] initiated the study of compressed pattern matching problems, 

■ that is, given a text string Q in compressed form Z and a specified (uncompressed) pattern P, find 

. all occurrences of P in Q without decompressing Z. The goal is to search more efficiently than 

I the naive approach of decompressing Z into Q and then searching for P in Q. Various compressed 

' pattern matching algorithms have been proposed depending on the type of pattern and compression 

c/3 . method, see e.g., [3,9, 11, 12, 14, 19]. For instance, given a string Q of length u compressed with 

the Ziv-Lempel- Welch scheme [24] into a string of length n, Amir et al. [4] gave an algorithm for 
finding all exact occurrences of a pattern string of length m in 0{n + m?) time and space. 
^ ■ In this paper we study the classical approximate string matching and regular expression match- 

ing problems in the context of compressed texts. As in previous work on these problems [11,19] we 
focus on the popular ZL78 and ZLW adaptive dictionary compression schemes [24,26]. We present 
a new technique that gives a general time-space trade-off. The resulting algorithms improve all 
previously known complexities for both problems. In particular, we significantly improve the space 
bounds. When searching large text databases, space is likely to be a bottleneck and therefore this 
is of crucial importance. 
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Pattern Matching, 2007. 

^IT University of Copenhagen, Rued Langgaards Vej 7, 2300 Copenhagen S, Denmark. Email: beetle@itu.dk. 
■''University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark, Email; rolf@imada.sdu.dk. 
''Technical University of Denmark, Building 322, 2800 Kgs. Lyngby, Denmark. Email; ilg@iimn.dtu.dk. 



1 



1.1 Approximate String Matching 

Given strings P and Q and an error threshold k, the classical approximate string matching problem 
is to find all ending positions of substrings of Q whose edit distance to P is at most k. The edit 
distance between two strings is the minimum number of insertions, deletions, and substitutions 
needed to convert one string to the other. The classical dynamic programming solution due to 
Sellers [22] solves the problem in 0{um) time and 0{m) space, where u and m are the length 
of Q and P, respectively. Several improvements of this result are known, see e.g., the survey by 
Navarro [18]. For this paper we are particularly interested in the fast solution for small values of k, 
namely, the 0{uk) time algorithm by Landau and Vishkin [13] and the more recent 0{uk'^/m + u) 
time algorithm due to Cole and Hariharan [7] (we assume w.l.o.g. that k < m). Both of these can 
be implemented in 0{m) space. 

Recently, Karkkainen et al. [11] studied this problem for text compressed with the ZL78/ZLW 
compression schemes. If n is the length of the compressed text, their algorithm achieves 0{nmk + 
occ) time and 0{nmk) space, where occ is the number of occurrences of the pattern. Currently, this 
is the only non-trivial worst-case bound for the general problem on compressed texts. For special 
cases and restricted versions, other algorithms have been proposed [15,21]. An experimental study 
of the problem and an optimized practical implementation can be found in [20] . 

In this paper, we show that the problem is closely connected to the uncompressed problem and 
we achieve a simple time-space trade-off. More precisely, let t{m,u,k) and s{m,u,k) denote the 
time and space, respectively, needed by any algorithm to solve the (uncompressed) approximate 
string matching problem with error threshold k for pattern and text of length m and u, respectively. 
We show the following result. 

Theorem 1 Let Q be a string compressed using ZL78 into a string Z of length n and let P be a 
pattern of length m. Given Z, P, and a parameter t > 1, we can find all approximate occurrences 
of P in Q with at most k errors in 0{n{T + m + t{m,2m + 2k, k)) + occ) expected time and 
0{n/T + m + s{m, 2m + 2k, k) + occ) space. 

The expectation is due to hashing and can be removed at an additional 0{n) space cost. In this case 
the bound also hold for ZLW compressed strings. We assume that the algorithm for the uncompressed 
problem produces the matches in sorted order (as is the case for all algorithms that we are aware 
of). Otherwise, additional time for sorting must be included in the bounds. To compare Theorem [T] 
with the result of Karkkainen et al. [11], plug in the Landau- Vishkin algorithm and set r = mk. 
This gives an algorithm using 0{nmk+ occ) time and 0{n/mk + m + occ) space. This matches the 
best known time bound while improving the space by a factor @(m?k'^). Alternatively, if we plug 
in the Cole-Hariharan algorithm and set t = k^ + m we get an algorithm using 0{nk^ + nm + occ) 
time and 0{n/{k^ + m) + m + occ) space. Whenever k = 0(m^/^) this is 0{nm + occ) time and 
0{n/m + m + occ) space. 

To the best of our knowledge, all previous non-trivial compressed pattern matching algorithms 
for ZL78/ZLW compressed text, with the exception of a very slow algorithm for exact string matching 
by Amir et al. [4], use n[n) space. This is because the algorithms explicitly construct the dictionary 
trie of the compressed texts. Surprisingly, our results show that for the ZL78 compression schemes 
this is not needed to get an efficient algorithm. Conversely, if very little space is available our 
trade-off shows that it is still possible to solve the problem without decompressing the text. 
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1.2 Regular Expression Matching 

Given a regular expression R and a string Q, the regular expression matching problem is to find 
all ending position of substrings in Q that matches a string in the language denoted by R. The 
classic textbook solution to this problem due to Thompson [23] solves the problem in 0{um) time 
and 0{m) space, where u and m are the length of Q and R, respectively. Improvements based on 
the Four Russian Technique or word-level parallelism are given in [5,6,17]. 

The only solution to the compressed problem is due to Navarro [19]. His solution depends on 
word RAM techniques to encode small sets into memory words, thereby allowing constant time 
set operations. On a unit-cost RAM with w-hit words this technique can be used to improve an 
algorithm by at most a factor 0{w). For w = 0{logu) a similar improvement is straightforward to 
obtain for our algorithm and we will therefore, for the sake of exposition, ignore this factor in the 
bounds presented below. With this simplification Navarro's algorithm uses 0{nm? + occ ■ mlogm) 
time and 0{nm?) space, where n is the length of the compressed string. In this paper we show the 
following time-space trade-off: 

Theorem 2 Let Q he a string compressed using ZL78 or ZLW into a string Z of length n and let R 
be a regular expression of length m. Given Z, R, and a parameter r > 1, we can find all occurrences 
of substrings matching R in Q in 0{nm{m + r) + occ ■ mlogm) time and 0{nm?' /t + nm) space. 

If we choose r = m we obtain an algorithm using 0{nm?' + occ ■ mlogm) time and 0{nm) space. 
This matches the best known time bound while improving the space by a factor 0(m). With word- 
parallel techniques these bounds can be improved slightly. The full details are given in Section [C5l 

1.3 Techniques 

If pattern matching algorithms for ZL78 or ZLW compressed texts use 0(n) working space they can 
explicitly store the dictionary trie for the compressed text and apply any linear space data structure 
to it. This has proven to be very useful for compressed pattern matching. However, as noted by 
Amir et al. [4], 0(n) working space may not be feasible for large texts and therefore more space- 
efficient algorithms are needed. Our main technical contribution is a simple o(n) data structure 
for ZL78 compressed texts. The data structure gives a way to compactly represent a subset of the 
trie which combined with the compressed text enables algorithms to quickly access relevant parts 
of the trie. This provides a general approach to solve compressed pattern matching problems in 
o{n) space, which combined with several other techniques leads to the above results. 

2 The Ziv-Lempel Compression Schemes 

Let S be an alphabet containing o" = |S| characters. A string Q is a sequence of characters from 
S. The length oi Q \s u = \Q\ and the unique string of length is denoted e. The ith character of 
Q is denoted Q[i] and the substring beginning at position i of length j — i + 1 is denoted Q[i, j]. 
The Ziv-Lempel algorithm from 1978 [26] provides a simple and natural way to represent strings, 
which we describe below. Define a ZL78 compressed string (abbreviated compressed string in the 
remainder of the paper) to be a string of the form 

Z = Zi---Zn = (n,ai)(?^2,Q;2) • • • {rn,Oin), 
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Z= (0,a)(0,n)(l,n)(l,s) 



Q = ananas 




Figure 1: The compressed string Z representing Q and the corresponding dictionary trie D. Taken 
from [19]. 

where € {0,... ,i — 1} and Qj G S. Each pair Zi = (ri,ai) is a compression element, and 
and ai are the reference and label of Zj, denoted by reference(zj) and label(zj), respectively. Each 
compression element represents a string, called a phrase. The phrase for Zj, denoted phrase(zj), is 
given by the fohowing recursion. 



The • denotes concatenation of strings. The compressed string Z represents the concatenation of 
the phrases, i.e., the string phrase(zi) • • • phrase(z„). 

Let Q be a string of length u. In ZL78, the compressed string representing Q is obtained by 
greedily parsing Q from left-to-right with the help of a dictionary D. For simplicity in the presen- 
tation we assume the existence of an initial compression element zq = (0,e) where phrase(zo) = £• 
Initially, let zq = (0,e) and let D = {e}. After step i we have computed a compressed string 
zqZi ■ ■ ■ Zi representing Q[l,j] and D = {phrase(zo), • • • , phrase(zj)}. We then find the longest pre- 
fix of Q[j + 1, n — 1] that matches a string in D, say phrase(2;fc), and let phrase(zi+i) = phrase(2;fc) • 
Q[j -|- 1 + |phrase(zfc)|]. Set D = DU {phrase(zi4.i)} and let Zj+i = {k,Q[j + 1 -|- |phrase(2;i+i)|]). 
The compressed string zqZi . . . Zi^i now represents the string Q[l,j + |phrase(zi+i)|]) and D = 
{phrase(zo), • • • , phrase(zi+i)}. We repeat this process until all of Q has been read. 

Since each phrase is the concatenation of a previous phrase and a single character, the dictionary 
D is prefix-closed, i.e., any prefix of a phrase is a also a phrase. Hence, we can represent it 
compactly as a trie where each node i corresponds to a compression element Zi and phrase(2;j) is 
the concatenation of the labels on the path from Zi to node i. Due to greediness, the phrases are 
unique and therefore the number of nodes in D for a compressed string Z of length n is n + 1. An 
example of a string and the corresponding compressed string is given in Fig. [TJ 

Throughout the paper we will identify compression elements with nodes in the trie D, and 
therefore we use standard tree terminology, briefly summed up here: The distance between two 
elements is the number of edges on the unique simple path between them. The depth of element z 
is the distance from z to zq (the root of the trie). An element x is an ancestor of an element z if 
phrase(3;) is a prefix of phrase(2;). If also |phrase(x)| = |phrase(z)| — 1 then x is the parent of z. If 
X is ancestor of z then z is a descendant of x and if x is the parent of z then z is the child of x.The 
length of a path p is the number of edges on the path, and is denoted \p\ . The label of a path is 
the concatenation of the labels on these edges. 

Note that for a compression element z, reference(z) is a pointer to the parent of z and label(2;) 
is the label of the edge to the parent of z. Thus, given z we can use the compressed text Z directly 
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to decode the label of the path from z towards the root in constant time per element. We will use 
this important property in many of our results. 

If the dictionary D is implemented as a trie it is straightforward to compress Q or decompress 
Z in 0{u) time. Furthermore, if we do not want to explicitly decompress Z we can compute the 
trie in 0{n) time, and as mentioned above, this is done in almost all previous compressed pattern 
matching algorithm on Ziv-Lempel compression schemes. However, this requires at least $7(n) space 
which is insufficient to achieve our bounds. In the next section we show how to partially represent 
the trie in less space. 

2.1 Selecting Compression Elements 

Let Z — zo • • ■ be a compressed string. For our results we need an algorithm to select a compact 

subset of the compression elements such that the distance from any element to an element in the 
subset is no larger than a given threshold. More precisely, we show the following lemma. 

Lemma 1 Let Z he a compressed string of length n and let 1 < t < n be parameter. There is a 
set of compression elements C of Z, computable in O(nT) expected time and 0{n/T) space with the 
following properties: 

(t) \C\=0{n/T). 

(a) For any compression element Zi in Z, the minimum distance to any compression element in 
C is at most 2r. 

Proof. Let 1 < r < n be a given parameter. We build C incrementally in a left-to-right scan of 
Z. The set is maintained as a dynamic dictionary using dynamic perfect hashing [8], i.e., constant 
time worst-case access and constant time amortized expected update. Initially, we set C = {-^o}- 
Suppose that we have read ZQ,....,Zi. To process z^+i we follow the path p of references until we 
encounter an element y such that y £ C. We call y the nearest special element of Zj+i. Let I be the 
number of elements in p including Zi+i and y. Since each lookup in C takes constant time the time 
to find the nearest special element is 0{l). If Z < 2 • r we are done. Otherwise, if Z = 2 ■ r, we find 
the rth element y' in the reference path and set C := C U {y'}- As the trie grows under addition of 
leaves condition (ii) follows. Moreover, any element chosen to be in C has at least r descendants of 
distance at most r that are not in C and therefore condition (i) follows. The time for each step is 
0(r) amortized expected and therefore the total time is 0(nr) expected. The space is proportional 
to the size of C hence the result follows. □ 



2.2 Other Ziv-Lempel Compression Schemes 

A popular variant of ZL78 is the ZLW compression scheme [24]. Here, the label of compression 
elements are not explicitly encoded, but are defined to be the first character of the next phrase. 
Hence, ZLW does not offer an asymptotically better compression ratio over ZL78 but gives a better 
practical performance. The ZLW scheme is implemented in the UNIX program compress. From an 
algorithmic viewpoint ZLW is more difficult to handle in a space-efficient manner since labels are 
not explicitly stored with the compression elements as in ZL78. However, if 0,{n) space is available 
then we can simply construct the dictionary trie. This gives constant time access to the label of 
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a compression elements and therefore ZL78 and ZLW become " equivalent" . This is the reason why 
Theorem [T] holds only for ZL78 when space is o(n) but for both when the space is 0,{n). 

Another well-known variant is the ZL77 compression scheme [25] . Unlike ZL78 and ZLW phrases 
in the ZL77 scheme can be any substring of text that has already been processed. This makes 
searching much more difficult and none of the known techniques for ZL78 and ZLW seems to be 
applicable. The only known algorithm for pattern matching on ZL77 compressed text is due to 
Farach and Thorup [9] who gave an algorithm for the exact string matching problem. 

3 Approximate String Matching 

In this section we consider the compressed approximate string matching problem. Before presenting 
our algorithm we need a few definitions and properties of approximate string matching. 

Let A and B be strings. Define the edit distance between A and B, 'y{A, B), to be the minimum 
number of insertions, deletions, and substitutions needed to transform ^ to -B. We say that 
j G [1, IS*!] is a match with error at most k of A in a string S if there is an i E [1, j] such that 
'y{A, S[i, j]) < k. Whenever k is clear from the context we simply call j a match. All positions i 
satisfying the above property are called a start of the match j. The set of all matches of A in S" is 
denoted r(^, S). We need the following well-known property of approximate matches. 

Proposition 1 Any match j of A in S with at most k errors must start in the interval [max(l, j — 
\A\ + 1- k),mm{\S\,j -\A\ + 1 + k)]. 

Proof. Let I be the length of a substring B matching A and ending at j. If the match starts 
outside the interval then either Z < | A| — A; or / > |^| -|- A;. In these cases, more than k deletions or 
k insertions, respectively, are needed to transform B to A. □ 



3.1 Searching for Matches 

Let P be a string of length m and let k be an error threshold. To avoid trivial cases we assume 
that k < m. Given a compressed string Z = zqZi . . . Zn representing a string Q of length u we show 
how to find r(P, Q) efficiently. 

Let li = |phrase(zi)|, let uq = 1, and let Ui = Ui-i + U-i, for 1 < z < n, i.e., U is the length of 
the ith phrase and Ui is the starting position in Q of the ith phrase. We process Z from left-to-right 
and at the iih. step we find all matches in [ui^Ui + li — 1]. Matches in this interval can be either 
internal or overlapping (or both). A match j in [uj, tii + li — 1] is internal if it has a starting point 
in [ui,Ui + li — V\ and overlapping if it has a starting point in [1, — 1]. To find all matches we will 
compute the following information for Zj. 

• The start position, Ui, and length, /j, of phrase(2:j). 

• The relevant prefix, rpre(zj), and the relevant suffix, rsuf(zj), where 

rpre(2;j) = Q[ui, min(uj + m + k - l,Ui + k - I)] , 
rsuf(2;i) = (5[max(l, Ui + li — m — k),Ui + li — 1] . 

In other words, rpre(zj) is the largest prefix of length at most m-|- A; of phrase(2;j) and rsuf(2;i) 
is the substring of length m + k ending at Ui + li — 1. For an example see Fig. [2j 
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rpre(2j_i) rprc{z,) 
^1 





phrasc(zi_i) 


phrasc(zi) 





rsuf(zj_i) rsuf(zj) 



Figure 2: The relevant prefix and the relevant suffix of two phrases in Q. Here, |phrase(zj_i)| < 
m + k and therefore rsuf(2;j_i) overlaps with previous phrases. 

• The match sets Mj{zi) and Mo{zi), where 

M/(zi) = r(P,phrase(zi)) , 

Mo{zi) = r(P,rsuf(zi_i) • rpre(2;j)) . 

We assume that both sets are represented as sorted lists in increasing order. 

We call the above information the description of Zj. In the next section we show how to 
efficiently compute descriptions. For now, assume that we are given the description of Zj. Then, 
the set of matches in [itj, Ui + — 1] is reported as the set 

M{zi) = {j + Mi - 1 I i G Mi{zi)} U 

{j + Mi - 1 - |rsuf(2;i_i)| I j e Mo{zi) n [Mi, Mi + 1]} . 

We argue that this is the correct set. Since phrase(2;i) = Q[ui,Ui + k — 1] we have that 

j £ Mi{zi) ^j + ui-le T{P, Q[u^,Ui + . 

Hence, the set {j + Ui — 1 \ j £ Mj{zi)} is the set of all internal matches. Similarly, rsuf(2;i_i) • 
rpre(2:i) = Q[Mi — |rsuf(2;i-_i)|, Mj + |rpre(2;i)| — 1] and therefore 

j £ Mo{zi) <^ j + Mi - 1 - |rsuf(2;i-i)| G T{P, Q[ui - |rsuf(2;i-i)|, Mj + 1 + |rpre(zi)|]) . 

By Proposition [T] any overlapping match must start at a position within the interval [max(l,Mi — 
m + 1 — A;), Mi]. Hence, { j + Mj — 1 — |rsuf(zi_i)| | j G Mo{zi)} includes all overlapping matches in 
[Mi, Mi + /i — 1]. Taking the intersection with [Mi, Mi + U, — 1] and the union with the internal matches 
it follows that the set M{zi) is precisely the set of matches in [Mi, Mi + — 1]. For an example see 
Fig. El 

Next we consider the complexity of computing the matches. To do this we first bound the size 
of the Mi and Mq sets. Since the length of any relevant suffix and relevant prefix is at most m + k, 
we have that |Mo(2i)| < 2(m + A;) < 4m, and therefore the total size of the Mq sets is at most 
0{nm). Each element in the sets Mi{zq), . . . ,Mi(zn) corresponds to a unique match. Thus, the 
total size of the M/ sets is at most occ, where occ is the total number of matches. Since both sets 
are represented as sorted lists the total time to compute the matches for all compression elements 
is 0{nm + occ). 
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<5 = ananasbananer, P = base, Z= (0,a)(0,n)(l,n)(l,s)(0,b)(3,a)(2,e)(0,r) 



Descriptions 
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Figure 3: Example of descriptions. Z is the compressed string representing Q. We are looking for all 
matches of the pattern P with error threshold k = 2 in Z. The set of matches is {6, 7, 8, 9, 10, 12}. 



3.2 Computing Descriptions 

Next we show how to efficiently compute the descriptions. Let 1 < r < n be a parameter. Initially, 
we compute a subset C of the elements in Z according to Lemma [T] with parameter r. For each 
element Zj G C we store Ij, that is, the length of phrase(zj). If /j > m + k we also store the index 
of the ancestor x of Zj of depth m + k. This information can easily be computed while constructing 
C within the same time and space bounds, i.e., using 0{nT) time and OinjT) space. 

Descriptions are computed from left-to-right as follows. Initially, set /q = 0, uq = 0, rpre(zo) = e, 
rsuf(zo) = -^/(-zo) = 0, and Mo[^z<^) = 0. To compute the description of Zj, 1 < i < n, first follow 
the path p of references until we encounter an element Zj G C . Using the information stored at Zj 
we set li := \p\ + Ij and Ui = Ui^i + k-i. By Lemma [Dj^ii) the distance to Zj is at most 2t and 
therefore k and Ui can be computed in 0(t) time given the description of 

To compute rpre(zi) we compute the label of the path from zq towards Zi of length mm{m-\-k, li). 
There are two cases to consider: If < m + k we simply compute the label of the path from Zj 
to zq and let rpre(zj) be the reverse of this string. Otherwise {k > m + k), we use the "shortcut" 
stored at Zj to find the ancestor z^ of distance m + k to zq. The reverse of the label of the path 
from Zh to Zq is then rpre(zj). Hence, rpre(zj) is computed in 0{m + k + t) = 0{m + r) time. 

The string rsuf(2:j) may be the divided over several phrases and we therefore recursively follow 
paths towards the root until we have computed the entire string. It is easy to see that the following 
algorithm correctly decodes the desired substring of length min(m -|- k,Ui) ending at position Ui + 
k-l. 

1. Initially, set / := min(m + k, Ui + k — 1) , t := i, and s := e. 

2. Compute the path p of references from zt of length r = min(/, depth(zt)) and set s := s ■ 
label(p). 

3. If r < / set I := I — r, t := t — 1, and repeat step 2. 

4. Return rsuf(2;i) as the reverse of s. 
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Since the length of rsuf(2:i) is at most m + k, the algorithm finds it in 0{m + k) = 0{m) time. 

The match sets Mj and Mq are computed as follows. Let t{m, u, k) and s(m, u, k) denote the 
time and space to compute T[A,B) with error threshold k for strings A and B of lengths m and 
u, respectively. Since |rsuf(2;j_i) • rpre(2;i)| < 2m + 2k it follows that Mo{zi) can be computed in 
t{m,2m + 2k, k) time and s{m,2m + 2k, k) space. Since Mi{zi) = r(i-*, phrase(zi)) we have that 
j G Mj{zi) if and only if j G M7-(reference(2;j)) or j = li. By Proposition [T] any match ending in li 
must start within [max(l, li — m + 1 — k),mm{li, li — m + 1 + k)]. Hence, there is a match ending in 
li if and only if li G T{P, rsuf'(zj)) where rsuf (zi) is the suffix of phrase(2;j) of length min(m + k, k). 
Note that rsuf'(2;j) is a suffix of rsuf(zj) and we can therefore trivially compute it in 0{m + k) time. 
Thus, 

Mi{zi) = M7(reference(zi)) U {k \ k G r(P, rsuf'(zi))} . 

Computing T[P, rsuf'(zi)) uses t{m, m+k, k) time and s{m, m+k, k) space. Subsequently, construct- 
ing Mi{zi) takes 0{\Mi{zi)\) time and space. Recall that the elements in the Mj sets correspond 
uniquely to matches in Q and therefore the total size of the sets is occ. Therefore, using dynamic 
perfect hashing [8] on pointers to non-empty M/ sets we can store these using 0{occ) space in total. 

3.3 Analysis 

Finally, we can put the pieces together to obtain the final algorithm. The preprocessing uses 
0(nr) expected time and ©(n/r) space. The total time to compute all descriptions and report 
occurrences is expected 0(n(T + m + t{m,2m + 2k, k)) + occ). The description for Zi, except for 
Mi{zi), depends solely on the description of Zi-i. Hence, we can discard the description of Zi^i, 
except for Mi(zi-i), after processing Zi and reuse the space. It follows that the total space used is 
0{n/T + m + s{m, 2m + 2k, k) + occ). This completes the proof of Theorem [TJ Note that if we use 
n{n) space we can explicitly construct the dictionary. In this case hashing is not needed and the 
bounds also hold for the ZLW compression scheme. 

4 Regular Expression Matching 

4.1 Regular Expressions and Finite Automata 

First we briefly review the classical concepts used in the paper. For more details see, e.g., Aho et 
al. [1]. The set of regular expressions over S are defined recursively as follows: A character a G S is 
a regular expression, and if 5 and T are regular expressions then so is the concatenation, (S) ■ (T), 
the union, (5)|(T), and the star, (S)*. The language L{R) generated by R is defined as follows: 
L(a) = {a}, L(S ■ T) = L{S) ■ L{T), that is, any string formed by the concatenation of a string in 
L{S) with a string in L{T), L{S)\L{T) = L{S)UL{T), and L{S*) = Ui>o L{S)\ where L{S)^ = {e} 
and L{S)' = L{Sy-^ ■ L{S), for i > 0. 

A finite automaton is a tuple A = (V, E, S, 9, <I>), where y is a set of nodes called states, E is set 
of directed edges between states called transitions each labeled by a character from S U {e}, 9 
is a start state, and $ C y is a set ol final states. In short, A is an edge-labeled directed graph with 
a special start node and a set of accepting nodes. A is a deterministic finite automaton (DFA) if A 
does not contain any e-transitions, and all outgoing transitions of any state have different labels. 
Otherwise, ^ is a non- deterministic automaton (NFA). 

The label of a path p in ^ is the concatenation of labels on the transitions in p. For a subset 

5 of states in A and character a G S U {e}, define the transition map, 5{S,a), as the set of states 
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reachable from S via a path labeled a. Computing the set 5{S,a) is called a state-set transition. 
We extend 6 to strings by defining 5{S,a- B) = 6{S{S, a),B), for any string B and character a G S. 
We say that A accepts the string B if 6{{9},B) n $ / 0. Otherwise A rejects Q. As in the previous 
section, we say that j E [1, \B\] is a match iff there is an i S [1, j] such that A accepts B[i,j]. The 
set of all matches is denoted A{A,B). 

Given a regular expression R, an NFA A accepting precisely the strings in L{R) can be obtained 
by several classic methods [10,16,23]. In particular, Thompson [23] gave a simple well-known 
construction which we will refer to as a Thompson NFA (TNFA). A TNFA A for R has at most 2m 
states, at most 4m transitions, and can be computed in 0{m) time. Hence, a state-set transition can 
be computed in 0{m) time using a breadth- first search of A and therefore we can test acceptance 
of Q in 0{um) time and 0{m) space. This solution is easily adapted to find all matches in the 
same complexity by adding the start state to each of the computed state-sets immediately before 
computing the next. Formally, 5{S,a ■ B) = 5{5{S U {9},a),B), for any string B and character 
a S S. A match then occurs at position j if 5{{9}, Q[l,j]) H <I> 7^ 0. 

4.2 Searching for Matches 

Let A = (V,E,T,,6,^) be a TNFA with m states. Given a compressed string Z = zi . . . Zn 
representing a string Q of length u we show how to find A(j4, Q) efficiently. As in the previous 
section let /j and Ui, < i < n he the length and start position of phrase(2;j). We process Z from 
left-to-right and compute a description for Zi consisting of the following information. 

• The integers /j and Uj. 

• The state-set = ^{{9}, Q[l,Ui] + k — 1). 

• For each state s of j4 the compression element lastmatch(s, Zj) = x, where x is the ancestor 
of Zi of maximum depth such that ^({s}, phrase(x)) n $ 7^ 0. If there is no ancestor that 
satisfies this, then lastmatch(s, Zj) = _L. 

An example description is shown in Fig. HI The total size of the description for Zi is 0{m) and 
therefore the space for all descriptions is 0{nm). In the next section we will show how to compute 
the descriptions. Assume for now that we have processed zq, . . . , 2:j_i. We show how to find 
the matches within [uj, Ui + U — 1]. Given a state s define M(s, Zi) = {xi, . . . , Xk}, where xi = 
lastmatch(s, Zj), Xj = lastmatch(s, parent(xj_i)), 1 < j < fe, and lastmatch(s, x^) = _L, i.e., 
xi, . . . is the sequence of ancestors of Zi obtained by recursively following lastmatch pointers. 
By the definition of lastmatch and M{s,Zi) it follows that M{s,Zi) is the set of ancestors x of s 
such that 5{s,x) n $ / 0. Hence, if s G Sui_i then each element x G M{s,Zi) represents a match, 
namely, Ui + depth(x) — 1. Each match may occur for each of the |5'm._J states and to avoid 
reporting duplicate matches we use a priority queue to merge the sets M{s,Zi) for all s G Sui_i, 
while generating these sets in parallel. A similar approach is used in [19]. This takes O(logm) time 
per match. Since each match can be duplicated at most |5'„._j| = 0{m) times the total time for 
reporting matches is 0{occ ■ mlogm). 

4.3 Computing Descriptions 

Next we show how to compute descriptions efficiently. Let 1 < r < n be a parameter. Initially, 
compute a set C of compression elements according to Lemma[T]with parameter r. For each element 
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Figure 4: The compressed string Z representing Q and the corresponding dictionary trie D. The 
TNFA A for the regular expression R and the corresponding state-sets S'^- are given. The last- 
match pointers are as follows: lastmatch(57, 2:j) = {^o} for ^ = Oilj-'-^S, lastmatch(s2, ^^i) = 
lastmatch(54, Zj) = lastmatch(s5, Zj) = {2:3} for i = 3, 6, and lastmatch(56, -Zj) = {Z2} for i = 2,7. 
All other lastmatch pointers are _L. Using the description we can find the matches: Since S2 G Su^ 
the element 23 € M{s2,zq) represents the match uq + depth(z3) — 1 = 9. The other matches can 
be found similarly. 



Zj £ C we store Ij and the transition sets ^(s, phrase(zj)) for each state 5 in A. Each transition set 
uses 0{m) space and therefore the total space used for zj is 0{m?). During the construction of C 
we compute each of the transition sets by following the path of references to the nearest element 
y € C and computing state-set transitions from y to zj. By Lemma[T]^ii) the distance to y is at most 
2t and therefore all of the m transition sets can be computed in 0{Tm?') time. Since, |C| = 0{n/T) 
the total preprocessing time is 0{n/T ■ Tm?) = 0{nm?') and the total space is 0{n/T ■ m?). 

The descriptions can now be computed as follows. The integers li and Ui can be computed as 
before in 0{t) time. All lastmatch pointers for all compression elements can easily be obtained 
while computing the transitions sets. Hence, we only show how to compute the state-set values. 
First, let Su^, '■= {0}. To compute 5^. from we compute the path p to Zi from the nearest 

element y £ C. Let p' be the path from zq to y. Since phrase(zi) = label(p') • label(p) we can 
compute = , phrase(zj)) in two steps as follows. First compute the set 



s' 



U 



(5(s, phrase(y)) 



(1) 



Since y £ C we know the transition sets 5(5, phrase(y)) and we can therefore compute the union 
in 0{m?) time. Secondly, we compute 5„. as the set (5(5", label(p)). Since the distance to y is 
at most r this step uses 0{Tm) time. Hence, all the state-sets Suq, • • • , Su^ can be computed in 
0{nm{m + r)) time. 
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4.4 Analysis 

Combining it all, we have an algorithm using 0{n'm{m+T) + occ-mlogm) time and 0{nm+nm'^ /t) 
space. Note that since we are using r2(n) space, hashing is not needed and the algorithm works for 
ZLW as well. In summary, this completes the proof of Theorem [2j 

4.5 Exploiting Word-level Parallelism 

If we use the word-parallelism inherent in the word-RAM model, the algorithm of Navarro [19] 
uses 0{\m/w'] (2™ -|- nm) + occ ■ mlogm) time and 0{\m/w~\ (2™ -|- nm)) space, where w is the 
number of bits in a word of memory and space is counted as the number of words used. The key 
idea in Navarro's algorithm is to compactly encode state-sets in bit strings stored in 0{\m/w~\) 
words. Using a DFA based on a Glushkov automaton [10] to quickly compute state-set transitions, 
and bitwise OR and AND operations to compute unions and intersections among state-sets, it is 
possible to obtain the above result. The 0{\m/w~\ 2"^) term in the above bounds is the time and 
space used to construct the DFA. 

A similar idea can be used to improve Theorem [2j However, since our solution is based on 
Thompson's automaton we do not need to construct a DFA. More precisely, using the state-set 
encoding of TNFAs given in [6, 17] a state-set transition can be computed in 0([m/logn]) time 
after 0(n) time and space preprocessing. Since state-sets are encoded as bit strings each transition 
set uses [m/logn] space and the union in ([T|) can be computed in 0(m [m/logn]) time using a 
bitwise OR operation. As n > ^/u in ZL78 and ZLW, we have that logn > ^logu and therefore 
Theorem [2] can be improved by roughly a factor log u. Specifically, we get an algorithm using 
0{n [m/logu] (m + t) + occ ■ mlogm) time and 0{nm [m/logn] /t + nm) space. 
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